Reshaping data

Reformatting data is an essential data science skill

Let’s generate a data set.

expr.data <- matrix(sample(1:10, 20000, replace=TRUE), nrow = 5000, ncol = 4)
rownames(expr.data) <- vapply(1:5000, function(x) {paste0("GENE", x, collapse = '')}, "")
colnames(expr.data) <- vapply(1:4, function(x) {paste0("SAMPLE", x, collapse = '')}, "")
head(expr.data)
##       SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4
## GENE1       1       3       5       4
## GENE2      10       3       9      10
## GENE3       7       3       3       2
## GENE4       9       1       3       3
## GENE5       3       5      10       4
## GENE6      10       7       2      10

This data is GENES (in rows) by SAMPLES (in columns).

The machine learning community has a general convention: SAMPLES go in rows and FEATURES go in columns.

Transposing

Switching rows and columns is easy, the transpose function does this: t()

expr.data.transpose <- t(expr.data)
head(expr.data.transpose[,1:10])
##         GENE1 GENE2 GENE3 GENE4 GENE5 GENE6 GENE7 GENE8 GENE9 GENE10
## SAMPLE1     1    10     7     9     3    10     3     4     4      8
## SAMPLE2     3     3     3     1     5     7     2     7     6     10
## SAMPLE3     5     9     3     3    10     2     2    10     4      4
## SAMPLE4     4    10     2     3     4    10     1     8     3      3

This has created data that is wide format. Each row is a SAMPLE that contains every FEATURE measurement.

Melting and Pivoting

The alternative is long format. In long format, each row is a measurement of one feature for one sample. This is not the same as the transposition; the transpose has rows which are one feature for all samples.

We can convert wide data to long data with the melt function from the reshape2 package. Melt converts the rownames and colnames so that each row has the format SAMPLE|FEATURE|MEASUREMENT.

# Melt knows to melt on row and column names for a MATRIX
# Melt is not smart enough to figure out what it should do on a DATA.FRAME
# Additional options (like `id.vars`) must be set if you melt a DATA.FRAME

expr.data.melted <- reshape2::melt(expr.data.transpose)
head(expr.data.melted)
##      Var1  Var2 value
## 1 SAMPLE1 GENE1     1
## 2 SAMPLE2 GENE1     3
## 3 SAMPLE3 GENE1     5
## 4 SAMPLE4 GENE1     4
## 5 SAMPLE1 GENE2    10
## 6 SAMPLE2 GENE2     3
colnames(expr.data.melted) <- c("Sample", "Gene", "Expression")
head(expr.data.melted)
##    Sample  Gene Expression
## 1 SAMPLE1 GENE1          1
## 2 SAMPLE2 GENE1          3
## 3 SAMPLE3 GENE1          5
## 4 SAMPLE4 GENE1          4
## 5 SAMPLE1 GENE2         10
## 6 SAMPLE2 GENE2          3
# A melted MATRIX will always become a DATA.FRAME

Sometimes you will have long format data which needs to be made wide. This is often called pivoting the data, and is a job for the acast function in the reshape2 package.

# You always have to tell acast exactly what to do
# It needs to know which column in your long data should be expanded into multiple columns in your wide data
# And which column in your long data has rownames
# You can tell it with ROWNAMES ~ COLUMNNAMES as the second argument

expr.data.pivoted <- reshape2::acast(expr.data.melted, Sample ~ Gene, value.var = "Expression")
head(expr.data.pivoted[, 1:10])
##         GENE1 GENE2 GENE3 GENE4 GENE5 GENE6 GENE7 GENE8 GENE9 GENE10
## SAMPLE1     1    10     7     9     3    10     3     4     4      8
## SAMPLE2     3     3     3     1     5     7     2     7     6     10
## SAMPLE3     5     9     3     3    10     2     2    10     4      4
## SAMPLE4     4    10     2     3     4    10     1     8     3      3
# acast produces a MATRIX
# dcast is the same type of function; dcast produces a DATA.FRAME

Probability functions

Probability Mass Function (D function)

R has functions that calculate probabilities. You don’t need to work through the binomial theorum by hand.

\[ p(x) = \binom{n}{x} (p)^{n}(1-p)^{(x-n)} \]

This is the binomial theorem. The probability mass function is calculated with the pbinom function.

# The probability of EVENT is p. By default, the probability of NOT EVENT is (1-p).
p <- 0.8

# The number of times EVENT occurs
x <- 2

# The number of trials
n <- 3

# Solve the binomial theorum with these numbers
dbinom(x, n, p)
## [1] 0.384

That’s the probability the EVENT will happen 2 times out of 3 possible trials.

What about 50 trials? What are the chances for each possible outcome?

trial.data <- NULL
for (i in 0:100) {
  p <- i / 100
  trial.data <- rbind(trial.data, data.frame(Trial=0:50, P=round(p, 4), Px=dbinom(0:50, 50, p)))
}

ggplot(trial.data, aes(x=Trial, y=Px)) +
  geom_line() +
  theme_classic() +
  scale_x_discrete(limits=0:50) +
  coord_cartesian(ylim=c(0, 0.25)) +
  labs(title = 'Binomial Probability Mass Function [p = {frame_time}]', x = 'EVENT Occurs per 50 Trials', y = 'Probability') +
  theme(axis.text.x = element_text(angle=90, hjust=0.5, vjust=1), title = element_text(face='bold', size = 14)) +
  transition_time(P)

Cumulative Distribution Function (P function)

For 50 trials, what are the chances that at most 10 EVENTS will occur? That’s the job for a Cumulative Distribution Function (CDF).

R has a function for that too.

# The probability of EVENT is p. By default, the probability of NOT EVENT is (1-p).
p <- 0.8

# The number of times EVENT occurs
x <- 10

# The number of trials
n <- 50

pbinom(x, n, p)
## [1] 1.290837e-19

What about for every possible outcome?

cdf.trial.data <- NULL
for (i in 0:100) {
  p <- i / 100
  cdf.trial.data <- rbind(cdf.trial.data, data.frame(Trial=0:50, P=round(p, 4), Px=pbinom(0:50, 50, p)))
}

ggplot(cdf.trial.data, aes(x=Trial, y=Px)) +
  geom_line() +
  theme_classic() +
  scale_x_discrete(limits=0:50) +
  coord_cartesian(ylim=c(0, 1)) +
  labs(title = 'Binomial Cumulative Distribution Function [p = {frame_time}]', x = 'EVENT occurs no more then X times in 50 trials', y = 'Probability') +
  theme(axis.text.x = element_text(angle=90, hjust=0.5, vjust=1), title = element_text(face='bold', size = 14)) +
  transition_time(P)

Quantile Function (Q function)

For 50 trials, 80% of the trials are expected to have more than some number of EVENTs. What’s that expected number?

R has a function for that too.

# The probability of EVENT is p. By default, the probability of NOT EVENT is (1-p).
p <- 0.8

# The probability of having fewer than X number of events (0.2 means 80% will have more than this many events)
q <- 0.2

# The number of trials
n <- 50

qbinom(q, n, p)
## [1] 38
q.trial.data <- NULL
for (i in 0:100) {
  p <- i / 100
  iqr <- qbinom(0.75, 50, p)- qbinom(0.25, 50, p)
  q.trial.data <- rbind(q.trial.data, data.frame(P=round(p, 4), 
                                                 Pq25=qbinom(0.25, 50, p), 
                                                 Pq5=qbinom(0.5, 50, p),
                                                 Pq75=qbinom(0.75, 50, p),
                                                 Ymin=max(0, qbinom(0.25, 50, p) - (1.5 * iqr)),
                                                 Ymax=min(50, qbinom(0.75, 50, p) + (1.5 * iqr))))
}

ggplot(q.trial.data) +
  geom_boxplot(aes(x=1, lower=Pq25, middle=Pq5, upper=Pq75, ymin=Ymin, ymax=Ymax, group=P), stat="identity") +
  theme_classic() +
  coord_cartesian(xlim=c(0,2), ylim=c(0,51)) +
  labs(title = 'Binomial Cumulative Distribution Function [p = {frame_time}]', x = '', y = 'Successes') +
  theme(axis.text.x = element_blank(), title = element_text(face='bold', size = 14)) +
  transition_time(P)

Random Outcome Function (R function)

Sometimes you just want to sample from a distribution. Give some parameters and get a random result out.

R has a function for that too.

# The probability of EVENT is p. By default, the probability of NOT EVENT is (1-p).
p <- 0.8

# The number of times to run N trials and give the number of EVENTs that occur
x <- 10

# The number of trials
n <- 50

rbinom(x, n, p)
##  [1] 39 41 38 42 40 40 37 38 34 42